Attention Mechanism In Transformer

The attention mechanism is designed to focus on different parts of the input data depending on the context . The strength of...

Attention Mechanism In Transformer

Gaurav
February 10, 2024

Attention Mechanism:🔗

The attention mechanism is designed to focus on different parts of the input data, depending on the context. In the context of the Transformer model, the attention mechanism allows the model to focus on different words in the input sequence when producing an output sequence. The strength of the attention is determined by a score, which is calculated using a query, key, and value.

Attention Mechanism

The Basic Attention Equation:🔗

Given a Query (QQ), a Key (KK), and a Value (VV), the attention mechanism computes a weighted sum of the values, where the weight assigned to each value is determined by the query and the corresponding key.

The attention score for a query QQ and a key KK is calculated as:

score(Q,K)=QKT\text{score}(Q, K) = Q \cdot K^T

This score is then passed through a softmax function to get the attention weights:

softmax(score(Q,K)dk)\text{softmax}\left(\frac{\text{score}(Q, K)}{\sqrt{d_k}}\right)

Where dkd_k is the dimension of the key vectors (this scaling factor helps in stabilizing the gradients).

Finally, the output is calculated as a weighted sum of the values:

output=attention weights×V\text{output} = \text{attention weights} \times V

2. Matrix Calculation of Self-Attention:🔗

In practice, we don’t calculate attention for a single word, but rather for a set of words (i.e., a sequence). To do this efficiently, we use matrix operations.

Step 1: Calculate Query, Key, Value matrices🔗

Given an input matrix XX (which consists of embeddings of all words in a sequence), and weight matrices WQW_Q, WKW_K, and WVW_V that we've trained:

Q=X×WQQ = X \times W_Q
K=X×WKK = X \times W_K
V=X×WVV = X \times W_V

Step 2 to 5: Compute the Output of Self-Attention Layer🔗

Given the matrices QQ, KK, and VV that we've just computed:

  1. Calculate the dot product of QQ and KTK^T to get the score matrix:
Score=Q×KT\text{Score} = Q \times K^T
  1. Divide the score matrix by the square root of the depth dkd_k:
Scaled Score=Scoredk\text{Scaled Score} = \frac{\text{Score}}{\sqrt{d_k}}
  1. Apply the softmax function to the scaled score matrix:
Attention Weights=softmax(Scaled Score)\text{Attention Weights} = \text{softmax}(\text{Scaled Score})
  1. Multiply the attention weights by the value matrix VV:
Output=Attention Weights×V\text{Output} = \text{Attention Weights} \times V

This output is the result of the self-attention mechanism for the input sequence.


Multi-Head Attention:🔗

In multi-head attention, the idea is to have multiple sets of Query, Key, Value weight matrices. Each of these sets will generate different attention scores and outputs. By doing this, the model can focus on different subspaces of the data.

Multi-Head Attention

Let's denote the number of heads as hh.

  1. For each head ii, we have its own weight matrices: WQiW_{Q_i}, WKiW_{K_i}, and WViW_{V_i}.

  2. For each head ii, compute the Query, Key, and Value matrices just like in single-head attention calculate for each head from i to h.

Qi=X×WQiQ_i = X \times W_{Q_i}
Ki=X×WKiK_i = X \times W_{K_i}
Vi=X×WViV_i = X \times W_{V_i}
  1. Using the QiQ_i, KiK_i, and ViV_i matrices, we calculate the output for each head:
Scorei=Qi×KiT\text{Score}_i = Q_i \times K_i^T
Scaled Scorei=Scoreidk\text{Scaled Score}_i = \frac{\text{Score}_i}{\sqrt{d_k}}
Attention Weightsi=softmax(Scaled Scorei)\text{Attention Weights}_i = \text{softmax}(\text{Scaled Score}_i)
Outputi=Attention Weightsi×Vi\text{Output}_i = \text{Attention Weights}_i \times V_i

Now, after obtaining the output for each head, we need to combine these outputs to get a single unified output.

  1. Concatenation & Linear Transformation: The outputs from all heads are concatenated and then linearly transformed to produce the final output:
Concatenated Output=concat(Output1,Output2,...,Outputh)\text{Concatenated Output} = \text{concat}(\text{Output}_1, \text{Output}_2, ..., \text{Output}_h)
Final Output=Concatenated Output×WO\text{Final Output} = \text{Concatenated Output} \times W_O

Where WOW_O is another trained weight matrix.

This multi-head mechanism allows the Transformer to focus on different positions with different subspace representations, making it more expressive and capable of capturing various types of relationships in the data.


COMING SOON ! ! !

Till Then, you can Subscribe to Us.

Get the latest updates, exclusive content and special offers delivered directly to your mailbox. Subscribe now!

ClassFlame – Where Learning Meets Conversation! offers conversational-style books in Computer Science, Mathematics, AI, and ML, making complex subjects accessible and engaging through interactive learning and expertly curated content.


© 2024 ClassFlame. All rights reserved.